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Abstract 

Background: Mendelian disorders are mostly caused by single mutations in the DNA sequence of a gene, leading 
to a phenotype with pathologic consequences. Whole Exome Sequencing of patients can be a cost-effective 
alternative to standard genetic screenings to find causative mutations of genetic diseases, especially when the 
number of cases is limited. Analyzing exome sequencing data requires specific expertise, high computational 
resources and a reference variant database to identify pathogenic variants. 

Results: We developed a database of variations collected from patients with Mendelian disorders, which is 
automatically populated thanks to an associated exome-sequencing pipeline. The pipeline is able to automatically 
identify, annotate and store insertions, deletions and mutations in the database. The resource is freely available 
online http://exome.tigem.it. The exome sequencing pipeline automates the analysis workflow (quality control and 
read trimming, mapping on reference genome, post-alignment processing, variation calling and annotation) using 
state-of-the-art software tools. The exome-sequencing pipeline has been designed to run on a computing cluster 
in order to analyse several samples simultaneously. The detected variants are annotated by the pipeline not only 
with the standard variant annotations (e.g. allele frequency in the general population, the predicted effect on gene 
product activity, etc.) but, more importantly, with allele frequencies across samples progressively collected in the 
database itself, stratified by Mendelian disorder. 

Conclusions: We aim at providing a resource for the genetic disease community to automatically analyse whole 
exome-sequencing samples with a standard and uniform analysis pipeline, thus collecting variant allele frequencies 
by disorder. This resource may become a valuable tool to help dissecting the genotype underlying the disease 
phenotype through an improved selection of putative patient-specific causative or phenotype-associated variations. 



Background 

Mendelian disorders are inherited diseases caused by 
inborn defects in the DNA sequence of one or few genes. 
Most inherited genetic disorders are rare, although if 
taken collectively, they are estimated to affect -4% of 
newborns. There are -7000 disease phenotypes described 
in the Online Mendelian Inheritance in Man (OMIM) 
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Database [1] but the cause of about half of the described 
diseases is still unknown [2]. Whole Exome Sequencing 
(WES) of patients allows to find causative mutations of 
genetic diseases thanks to High-Throughput Sequencing 
(HTS) technologies [3]. WES is an effective alternative to 
standard genetic screenings to find causative mutations 
of genetic diseases when only few patients are available, 
as it is often the case for Mendelian disorders [4]. When 
compared to Whole Genome Sequencing (WGS), "WES is 
still to be preferred because the targeted region com- 
prises only 1-2% of the genome sequence and thus much 
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less reads are required to get the sequencing depth neces- 
sary to reliably identify mutations. Furthermore, the 
potentially damaging effect of a coding-region mutation 
on the gene product activity can be predicted with good 
accuracy [5-10], but this is much more difficult in the 
case of a non-coding region mutation [11,12]. 

WES has been successfully used to find candidate cau- 
sative mutations with as low as one affected individual 
[13-18]. One limitation of WES is that the percentage of 
samples where a candidate causative mutation is not 
found is still high [19]. This may happen when the causa- 
tive mutation lies outside the targeted region or in a posi- 
tion difficult to sequence, or may be due to incomplete 
penetrance and the presence of modifier genes [20,21]. 
Another factor affecting the outcome of the analysis is 
the bioinformatic analysis pipeline [22] and its stringency 
level, since no standard operating procedure is currently 
available. This means that in order to compare results of 
different WES samples, it is important to use a uniform 
analysis pipeline and a common reference databases to 
prioritise the detected variants. 

Indeed, despite the ever decreasing cost of sequencing 
experiments, the bioinformatic analysis of WES data 
requires high computational resources, trained experts 
and a reference variant database to select and prioritise 
the best candidate pathogenic variants. 

Our aim was to build a community-based resource 
providing a disease-oriented allele variant frequency 
repository for Mendelian disorders populated by means 
of an automatic exome-sequencing analysis pipeline. 
The expansion and usefulness of this resource will be 
driven by user-submitted WES samples collected from 
Mendelian disorder patients. 

Implementation 

Website 

The website is implemented in PHP. After user registra- 
tion, a new analysis can be started through the Create 
New submission page (Figure 1). The user has to pro- 
vide the presumptive (or known) Mendelian disorder 
associated to the sample, the mode of inheritance and 
the platform used for exome target enrichment. The dis- 
ease has to be chosen using a fixed vocabulary imple- 
menting the MEDIC hierarchical disease ontology [23] 
including all child terms to MeSH ID D009358: 
"Congenital, Hereditary, and Neonatal Dis- 
eases and Abnormalities". The disease list can be 
searched by directly typing the specific OMIM ID [1] or 
a keyword and the auto-completion function will auto- 
matically retreive all the available matching terms. The 
user should choose the definition that best describes the 
patient phenotype. The disease association can be later 
edited, for example when an initially presumptive diag- 
nosis is then confirmed following the WES analysis. In 



such cases, the user will initially choose a less specific 
disease definition, using the controlled vocabulary, and 
can then change it to a more specific one after receiving 
the analysis results. Ideally, the user should confirm the 
diagnosis only after having validated the mutations 
found. The user can submit multiple samples at once, if 
the samples correspond to related individuals. Each 
sample has to be uploaded as a pair of sequence files in 
FastQ format [24]. The user can follow the analysis pro- 
gression online and retrieve the results upon analysis 
completion (Figure 1). 

Pipeline Implementation 

The analysis pipeline is fully automated and it has a 
modular structure, as detailed below and in Additional 
file 1. Each module performs its task using custom 
scripts and state-of-the-art tools (Additional file 2). The 
pipeline was designed to run on a high-performance 
computing cluster using the Torque resource manager, 
but can easily be ported to any other job manager. The 
exome.tigem.it website uses a cluster with 8 computing 
nodes equipped with dual Xeon E5-2670 for a total 
amount of 128 computing cores and 376GB of RAM. 
Read quality assessment and trimming module 
Read sequences are submitted by the user in FastQ for- 
mat [24] and are initially assessed for the general quality 
using FastQC [25]. Reads are then trimmed to remove 
the Illumina adapter sequence and low quality ends 
(with quality score threshold of 20) using Trim Galore 
[26] and cutadapt [27]; a FastQC report is generated 
also on the trimmed sequences. 
Alignment on reference, post-alignment processing and 
summary statistics Modules 

Paired sequencing reads are aligned to the reference gen- 
ome (UCSC, hgl9 build) [28] using BWA [29]. Post- 
alignment process, including SAM conversion, sorting 
and duplicate removal are performed using Picard [30] 
and SAMtools [31]. The Genome Analysis Toolkit 
(GATK) [32] is then used to prepare the raw alignment 
for the variation calling with local realignment around 
small insertions-deletions (INDELs) and Base Quality 
Score Recalibration. This module is followed by a small 
module computing the read summary, target enrichment 
and target coverage statistics with SAMtools and BED- 
Tools [33]. 

SNVs and INDELs calling and annotation Module 

The identification of Single Nucleotide Variants (SNVs) 
and INDELs are separately performed using GATK Uni- 
fiedGenotyper, followed by Variant Quality Score Recali- 
bration [34] when applicable. The SNV and INDEL calls 
are then merged and annotated using ANNOVAR [35] to 
add the following information: the position in genes and 
amino acid change relative to the RefSeq gene model 
[36], presence in dbSNP [37], OMIM [1], frequency in 
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Figure 1 Website interface.Screenshots of the relevant sections of the web interface and outline of the user experience. The analysis is 
submitted through the Create New section, where the user submit the required analysis details. The user can choose to submit a single sample 
or multiple samples of related individuals and an auto-completion feature helps the user in assigning the correct disease id to the analysis. The 
user can follow the progress of the analysis in the Analysis section where he will also find a link to the Results page after the analysis is 
completed. 



NHLBI Exome Variant Server [38] and 1000 Genomes 
Project stratified by population [39], prediction of the 
potential damaging effect on protein activity with differ- 
ent algorithms [5-10] and evolutionary conservation 
scores [40,41]. The annotated results are then imported 
into the variation database. 

Variation database and report generation module 

The variation database is implemented in PostgreSQL 
and its structure with the main tables and relationships is 
shown in Additional file 3. A variations table contains an 
entry for each variation progressively collected in the 
database, each uniquely identified by genomic coordi- 
nates, reference and alternative alleles. Separate tables 
collect the statistics of the analysis calls, the annotation, 
the analysis and samples details. Finally, the diseases 
table contains the MEDIC hierarchical disease terms 
[23]. Once all the detected variants have been imported, 
the report generation module creates a report including 
all the variations found in the samples accompanied by 
the available annotations. Importantly, this module also 
dynamically computes allele frequencies stratified by dis- 
ease groups, using the hierarchical disease ontology. In 
this way, even if no or few samples are available in the 
database for a specific Mendelian disorder, a sufficient 
number of samples can be reached by grouping samples 
at the higher levels of the disease ontology. The variation 
reports of all the archived analysis are periodically 
refreshed to update allele frequencies on the analyses gra- 
dually added to the database. 



Results and discussion 

We developed a variation database for Mendelian disor- 
ders and associated WES analysis pipeline, in order anno- 
tate and store insertions, deletions and single nucleotide 
variants found in targeted resequencing projects, with a 
focus on patients affected by Mendelian disorders. The 
pipeline automates the analysis workflow using state-of- 
the-art tools, starting with raw sequences and providing 
the final list of annotated variants found in the sample. 
The pipeline allows for the simultaneous analysis of mul- 
tiple samples of related individuals. This option is recom- 
mended when analysing members of the same family, 
who are expected to share the same causative mutation. 
In this case, the variant calling algorithm uses a multi- 
sample model that takes into account the global allele 
count in calling the individual genotypes, which can 
highly improve sensitivity [34]. It is also possible to ana- 
lyse unaffected members of the family indicating them as 
controls. In this case the variants called in the unaffected 
members can be directly used to filter out all shared 
mutations that are not relevant in causing the proband 
phenotype. 

This resource is complementary to free and commer- 
cial databases of known mutations associated to specific 
diseases or phenotype, such as the HGMD [42] or the 
ClinVar [43] databases or locus specific databases 
(LSDBs) [44], since it focuses on patients affected by 
Mendelian disorder. It is also different from the other 
large scale databases providing population frequencies 
because the collected samples are not phenotypically 
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normal. Moreover, the associated WES analysis pipelines 
here presented has to be considered only as an accompa- 
nying tool to uniformly populate the database and cannot 
be considered a general purpose exome analysis pipeline, 
such as those recently presented in the literature [45-47]. 

The aim of this resource is to provide a standardised 
analysis of WES samples by providing state-ofthe-art 
pipeline and a standardised output of the variant calls 
and annotations, including the relative allele frequency 
in the anonymised samples already analysed in the data- 
base, stratified by disease. 

Uniformity of the calling quality is ensured by analysing 
all samples with the same pipeline. The analysis was 
implemented to have a low stringency for the initial var- 
iant calling, in order to minimise the false negatives, but it 
relies heavily on intersection filters for controls and gen- 
eral population frequency to rule out non-causative 
mutations. 

Submission of whole exome sequencing samples 

Whole exome sequencing samples are submitted 
through a webpage http://exome.tigem.it shown in 
Figure 1. The user has to provide the required informa- 
tion about the analysis and the samples to be analysed 
and upload the sequences (in FastQ [24] format). Sam- 
ples are required to be annotated with OMIM ID or, if 
a clear diagnosis is not available, with a MeSH term 
[48]. The analysis pipeline uses this annotation to group 
samples by disease and to calculate allele frequencies 
within disease groups (see Implementation). The analy- 
sis can be run on multiple samples provided they are 



Gene Related Annotation 

Symbol Exonic Func Gene Func Aachange 

TNP03 SloplOSSSNV exonic NM_001191028:c.2579delA:p X860C 



from the same family and associated to the same disease 
(or associated controls, e.g. unaffected relatives). The 
user can check the analysis progress through the Analy- 
sis section where all the submitted analyses are archived. 
In the same section the Results page becomes available 
after the analysis was successfully completed. The 
Results page includes the files produces at several steps: 
the quality reports, the processed alignment in BAM 
format [31], reads and target coverage statics, the com- 
plete call results in vcf format [49] and the annotated 
table of variants (Figure 2). The user will find on the 
website notification of every annotation database 
updated or a major analysis pipeline improvement and 
can choose to download updated results. Importantly, 
the sequence data (i.e. FastQ and BAM files) will never 
be made public, and on request these files will be 
deleted from the servers (as specified in the online User 
agreement). In this case, however, the user will not be 
able to get updated results. 

Automated analysis workflow 

As detailed in the Implementation section, the pipeline 
workflow follows a state-of-the-art implementation of 
the exome sequencing analysis [50] (Additional file 1). 
The analysis is initialized by a master script that config- 
ures and submits the modules performing the actual 
analysis steps on the computing cluster. The modules 
are configured with pre-defined sets of parameters to 
ensure uniformity of sensitivity across analyses. The 
user can only choose the number of samples to analyze, 
either as a single case or as a group analysis by selecting 



General Variation Information 
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Variation Frequency From Exome Sequencing Project (ESP) & 1000 Genome Project 
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dbSNPAnnotation 

dbsnpl37 dbSNP137 NonFlagged dbSNP137 Observed Allele OMIM 



Disease Group Allele Frequency From TIGEM Variant DB 

freq Disease ID:X freq Disease ID:X Controls freq Disease ID:1 freq Disease ID .. freq Disease ID:N 



Figure 2 Sample analysis report. The analysis report produced by the analysis pipeline includes these fields for each variant called. In addition 
to the standard predictions of the variant on protein function, the report includes variant frequencies across patients grouped by disease 
according to the MEDIC hierarchical disease ontology. For a description of each field please refer to the Additional file 4. 
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the Family option. In this latter case, also control sam- 
ples are allowed, but these are analyzed separately. 

The first module in the pipeline performs a quality 
assessment of reads and trimming of read ends to remove 
the adapter sequence or trailing low quality bases. Then 
reads are aligned to the reference genome (UCSC hgl9 
[28]) and the alignment is prepared for variation calling 
trough a series of steps: format conversion, sorting, local 
realignment around INDELs and Base Quality Score 
Recalibration. The local realignment around INDELs is 
an important step. It finds a consensus alignment among 
all the reads spanning a deletion or an insertion to both 
improve INDEL detection sensitivity and accuracy and to 
reduce SNV false calls due to misalignment of the flank- 
ing bases. The Base Quality Score Recalibration is a pro- 
cedure through which the raw quality scores provided by 
the instrument are recalibrated according an empirical 
error model derived by the sequences [34]. The SNV and 
INDEL variant calling are then performed and the calls 
are merged and annotated with information collected 
from several sources (Figure 2). The pipeline is designed 
to run on a cluster and can submit jobs in parallel to ana- 
lyse several samples simultaneously. The annotated var- 
iant calls are then imported into the variant database. 

Variant annotation and reporting 

The variation database is used to store the annotated 
exonic/splicing variants and to calculate allele frequen- 
cies stratified by groups of patients presenting the same, 
or similar, disease or phenotype according to the OMIM 
identifiers and MeSH terms, implementing the MEDIC 
hierarchical disease ontology [23]. Importantly, the 
internal allele frequency among samples progressively 
collected in the database itself, stratified by Mendelian 
disorder, is estimated, thus leading to a better selection 
of putative disease-specific causative variations. 

The database includes also annotations of variants 
from external sources (e.g. dbSNP, lOOOgenomes, 
Exome Variant Server and prediction algorithms), which 
are stored in a separate table and are periodically 
updated upon release of a new version of one or more 
external source database. 

The final report of the analysis, which will be available to 
the user, is a Microsoft Excel file including a table with all 
the relevant information useful to filter the selected var- 
iants and to prioritise them in order to choose the best 
possible candidates for subsequent validation (Figure 2). 
Specifically, in order to help the user in the filtering pro- 
cess, the table classifies variants in five classes, as shown in 
Table 1, on the basis of three factors: frequency in the gen- 
eral population, in unrelated diseases, and in the same or 
related disease(s), quality of the call and predicted impact 
on the gene product activity. 



Table 1 Variation Classification 



Variation Class 


Class 


Frequency 


Quality 


Impact 


1 


+ 


+ 


+ 


II 


+ 


+ 




III 


+ 




+ 


IV 


+ 






V 




+/- 


+/- 



Variants are automatically classified by the pipeline to help the user in 
detecting causative mutations (also refer to Additional file 4). A "+" sign 
means that the criterion indicated in the column is satisfied. Frequency 
criterion: the frequency in 1000 Genomes Project, Exome Variant Server and 
the TIGEM Variant database (in unrelated disease groups} must be < 1%; 
Quality criterion: the GATK variant calling tool must assign a "PASS" value to 
the "filter" field and score > 50 to the Genotype Quality field. Moreover, if the 
variant is homozygous, the percentage of reads supporting the call must be > 
80%. If the variant is heterozygous, the percentage of reads supporting the 
call must be > 30% and < 80%): Impact criterion: the mutation causes a gain 
or loss of a stop codon, a gain or loss of a splicing signal, or a frame-shift in 
the Open Reading Frame. Alternatively, the phylogenetic conservation must 
be significant (i.e. the fields UB PhyloP Pred="C" AND LIB Gerp++>2 AND 
Conserved>170) and 4 out of 5 prediction algorithms indicate a damaging 
effect (i.e. Avsift, UB SIFT, UB PolyPhen2, UB LRT, UB Mutation Taster). 

We give priority to the frequency criterion since when 
dealing with rare Mendelian disorders it is unlikely that 
the causative mutation may be common in the general 
population. These categories should be regarded as guides 
in prioritising the variant called in the analysis and can 
help in quickly highlighting the best candidate(s). 

Conclusion 

We developed a resource for the analysis of WES sam- 
ples for researchers studying Mendelian disorders. We 
believe this resource will be useful not only for those 
who do not have the hardware resources or the neces- 
sary expertise to run the analysis, but, more importantly, 
as a common reference for the community to collect 
and compare variants across patients with the same, or 
similar, disease. 

Each researcher by submitting data to the resource will 
enrich the database and thus leverage the frequency of the 
variations potentially associated to the Mendelian diseases. 
For this reason, we require all samples to be annotated 
with the OMIM/MeSH corresponding to the patient phe- 
notype in order to update the corresponding group allele 
frequencies with the new samples variant calls. 

The analysis report classifies variation by classes to 
help the user in prioritising candidate mutants. These 
classes should be regarded as prioritising guides and not 
as hard filters because it is possible that low-quality calls 
(e.g. due to low coverage or other technical problems in 
the regions) are true mutations that can be validated 
and could be lost in a highly stringent analysis. 

The resource provides variant frequencies according 
to disease groups, thus helping in detecting modifier or 
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secondary mutations which tend to be more represented 
in the patients affected by the same phenotype. The esti- 
mation of statistically significant associations will 
improve with the number of patients with homogeneous 
phenotype collected in the resource. 

The TIGEM Exome Mendelian Disorder Pipeline is a 
new community-based resource available to the Mende- 
lian diseases research community, built with the aim of 
help in dissecting the genotype underlying the disease phe- 
notype in patients affected by rare diseases. 

Availability and requirements 

♦ Project name: TIGEM Exome Mendelian Disorder 
Pipeline 

♦ Project home page: http://exome.tigem.it 

♦ Operating system(s): Platform independent 

♦ Programming language: bash, perl, R, SQL, PHP 

♦ License: Terms of use are on the website 

Additional material 
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BAM: Binary Alignment Map; GATK: Genome Analysis Toolkit; HTS: High- 
Throughput Sequencing; INDEL: small insertion or deletion; NGS: Next 
Generation Sequencing; SNP: Single Nucleotide Polymorphism; SNV: Single 
Nucleotide Variation; WES: Whole Exome Sequencing; WGS: Whole Genome 
Sequencing; VCF: Variant Call Format. 
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